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Deep belief networks are a powerful way to model complex prob- 
ability distributions. However, learning the structure of a belief net- 
work, particularly one with hidden units, is difficult. The Indian buf- 
fet process has been used as a nonparametric Bayesian prior on the 
directed structure of a belief network with a single infinitely wide 
hidden layer. In this paper, we introduce the cascading Indian buffet 
process (CIBP), which provides a nonparametric prior on the struc- 
ture of a layered, directed belief network that is unbounded in both 
depth and width, yet allows tractable inference. We use the CIBP 
prior with the nonlinear Gaussian belief network so each unit can 
additionally vary its behavior between discrete and continuous rep- 
resentations. We provide Markov chain Monte Carlo algorithms for 
inference in these belief networks and explore the structures learned 
on several image data sets. 



1. Introduction. The belief network or directed probabilistic graphical 
model [Pearl, 1988] is a popular and useful way to represent complex prob- 
ability distributions. Methods for learning the parameters of such networks 
are well-established. Learning network structure, however, is more difficult, 
particularly when the network includes unobserved hidden units. Then, not 
only must the structure (edges) be determined, but the number of hidden 
units must also be inferred. This paper contributes a novel nonparametric 
Bayesian perspective on the general problem of learning graphical models 
with hidden variables. Nonparametric Bayesian approaches to this problem 
are appealing because they can avoid the difficult computations required 
for selecting the appropriate a posteriori dimensionality of the model. In- 
stead, they introduce an infinite number of parameters into the model a pri- 
ori and inference determines the subset of these that actually contributed 
to the observations. The Indian buffet process (IBP) [Ghahramani et al., 
2007, Griffiths and Ghahramani, 2006] is one example of a nonparamet- 
ric Bayesian prior and it has previously been used to introduce an infi- 
nite number of hidden units into a belief network with a single hidden 
layer [Wood et al., 2006]. 



*http : / /www . cs . toronto . edu/~ rpa 



2 



R.P. ADAMS ET AL. 



This paper unites two important areas of research: nonparametric Baye- 
sian methods and deep belief networks. To date, work on deep belief net- 
works has not addressed the general structure-learning problem. We there- 
fore present a unifying framework for solving this problem using nonpara- 
metric Bayesian methods. We first propose a novel extension to the Indian 
buffet process — the cascading Indian buffet process (CIBP) — and use 
the Foster-Lyapunov criterion to prove convergence properties that make it 
tractable with finite computation. We then use the CIBP to generalize the 
single-layered, IBP-based, directed belief network to construct multi-layered 
networks that are both infinitely wide and infinitely deep, and discuss use- 
ful properties of such networks including expected in-degree and out-degree 
for individual units. Finally, we combine this framework with the powerful 
continuous sigmoidal belief network framework [Frey, 1997]. This allows us 
to infer the type (i.e., discrete or continuous) of individual hidden units — an 
important property that is not widely discussed in previous work. To sum- 
marize, we present a flexible, nonparametric framework for directed deep 
belief networks that permits inference of the number of hidden units, the 
directed edge structure between units, the depth of the network and the 
most appropriate type for each unit. 

2. Finite Belief Networks. We consider belief networks that are lay- 
ered directed acyclic graphs with both visible and hidden units. Hidden 
units are random variables that appear in the joint distribution described 
by the belief network but are not observed. We index layers by m, increasing 
with depth up to M, and allow visible units (i.e., observed variables) only in 
layer m = 0. We require that units in layer m have parents only in layer m+1. 
Within layer m, we denote the number of units as and index the units 

with k so that the kth. unit in layer m is denoted u^. We use the no- 
tation u( m ^ to refer to the vector of all K^ m ' units for layer m together. 
A binary K^ m ~ 1 ^ xK^ m ' matrix Z^ 71 ' specifies the edges from layer m to 

layer m— 1, so that element ZjF^, = 1 iff there is an edge from unit iti/ to 

., (m-l) 
unit u k 

A unit's activation is determined by a weighted sum of its parent units. 
The weights for layer m are denoted by a i^(' m_1 ) x real- valued ma- 
trix W~( m \ so that the activations for the units in layer m can be writ- 
ten as yM = (W (m+1) 0Z( m+1 V (m+1) +7 (m \ wh ere 7< m > is a -dimen- 
sional vector of bias weights and the binary operator indicates the Hada- 
mard (elementwise) product. 

To achieve a wide range of possible behaviors for the units, we use the 
nonlinear Gaussian belief network (NLGBN) [Frey, 1997, Frey and Hinton, 
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1999] framework. In the NLGBN, the distribution on arises from adding 
zero mean Gaussian noise with precision to the activation sum . 
This noisy sum is then transformed with a sigmoid function cr(-) to ar- 
rive at the value of the unit. We modify the NLGBN slightly so that the 
sigmoid function is from the real line to (—1,1), i.e. a : 1R— >(— 1,1), via 
<t(x) = 2/(1 + exp{x}) — 1. The distribution of given its parents is then 



cxp 



(m) 



_1 (4 m) )-ri m) 



where cr'(x) = -^a(x). As discussed in Frey [1997] and shown in Figure 1, 

different choices of v^ 1 yield different belief unit behaviors from effectively 
discrete binary units to nonlinear continuous units. In the multilayered con- 
struction we have described here, the joint distribution over the units in a 
NLGBN is 



(1) p({« (m) C=o I {ZW wM}^, {7 M, {z^lfir^o, 

Up(^U m) A m) ) 



k=l 



n riK4 m) iri m) ,4 m) )- 

m=0 k=l 



3. Infinite Belief Networks. Conditioned on the number of layers M, 
the layer widths and the network structures Z^ m \ inference in be- 

lief networks can be straightforwardly implemented using Markov chain 
Monte Carlo [Neal, 1992]. Learning the depth, width and structure, however, 
presents significant computational challenges. In this section, we present a 
novel nonparametric prior, the cascading Indian buffet process, for multi- 
layered belief networks that are both infinitely wide and infinitely deep. By 
using an infinite prior we avoid the need for the complex dimensionality- 
altering proposals that would otherwise be required during inference. 

3.1. The Indian buffet process. Section 2 used the binary matrix Z^ as 
a convenient way to represent the edges connecting layer m to layer m—1. We 
stated that Z (m > was a finite K( m -V xK^ matrix. We can use the Indian 
buffet process (IBP) [Griffiths and Ghahramani, 2006] to allow this matrix 
to have an infinite number of columns. We assume the two-parameter IBP 
[Ghahramani et al., 2007], and use Z( m > ~ I BP(a, pi) to indicate that the ma- 
trix Z^ € {0, l}^ (m r> xo ° is drawn from an IBP with parameters a, /3 > 0. 
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(a) v = | (b) v = 5 (c) k = 1000 



Fig 1: Three modes of operation for the NLGBN unit. The black solid line shows 
the zero mean distribution (i.e. y = 0), the red dashed line shows a pre-sigmoid 
mean of +1 and the blue dash-dot line shows a pre-sigmoid mean of — 1. (a) Bi- 
nary behavior from small precision, (b) Roughly Gaussian behavior from medium 
precision, (c) Deterministic behavior from large precision. 

The eponymous metaphor for the IBP is a restaurant with an infinite num- 
ber of dishes available. Each customer chooses a finite set of dishes to taste. 
The rows of the binary matrix correspond to customers and the columns 
correspond to dishes. If the jth customer tastes the kth dish, then 2^ = 1, 
otherwise Zjj t = 0. The first customer into the restaurant samples a number 
of dishes that is Poisson distributed with parameter a. After that, when 
the jth customer enters the restaurant, she selects dish k with probabil- 
ity T}k/(j+/3 — l)i where r]k is the number of previous customers that have 
tried the fcth dish. She then chooses a number of additional dishes to taste 
that is Poisson distributed with parameter a/3/(j+/3—l). Even though each 
customer chooses dishes based on their popularity with previous customers, 
the rows and columns of the resulting matrix are infinitely exchange- 

able. 

As in Wood et al. [2006] , if the model of Section 2 had only a single hidden 
layer, i.e. M = 1, then the IBP could be used to make that layer infinitely 
wide. While a belief network with an infinitely-wide hidden layer can rep- 
resent any probability distribution arbitrarily closely [Le Roux and Bengio, 
2008], it is not necessarily a useful prior on such distributions. Without 
intra-layer connections, the the hidden units are independent a priori. This 
"shallowness" is a strong assumption that weakens the model in practice 
and the explosion of recent literature on deep belief networks (see, e.g. 
Hinton and Salakhutdinov [2006], Hinton et al. [2006]) speaks to the em- 
pirical success of belief networks with more hidden structure. 

3.2. The cascading Indian buffet process. To build a prior on belief net- 
works that are unbounded in both width and depth, we use an IBP-like ob- 
ject that provides an infinite sequence of binary matrices Z^°\ Z^\ Z^ 2 \ 
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We require the matrices in this sequence to inherit the useful sparsity prop- 
erties of the IBP, with the constraint that the columns from Z^ n ~ 1 ^ corre- 
spond to the rows in Z^ . We interpret each matrix as specifying the 
directed edge structure from layer m to layer m — 1, where both layers have 
a potentially- unbounded width. 

We propose the cascading Indian buffet process to provide a prior with 
these properties. The CIBP extends the vanilla IBP in the following way: 
each of the "dishes" in the restaurant are also "customers" in another Indian 
buffet process. The columns in one binary matrix correspond to the rows in 
another binary matrix. The CIBP is infinitely exchangeable in the rows of 
matrix Z^°K Each of the IBPs in the recursion is exchangeable in its rows 
and columns, so it does not change the probability of the data to propagate 
a permutation back through the matrices. 

If there are 

K (o) 

customers in the first restaurant, a surprising result is 
that, for finite , a, and f3, the CIBP recursion terminates with proba- 
bility one. By "terminate" we mean that at some point the customers do 
not taste any dishes and all deeper restaurants have neither dishes nor cus- 
tomers. Here we only sketch the intuition behind this result. A proof is 
provided in Appendix A. 

The matrices in the CIBP are constructed in a sequence, starting with 
m = 0. The number of nonzero columns in matrix Z^ m+1 \ K^ m+1 \ is de- 
termined entirely by K^ m \ the number of active nonzero columns in Z^ m \ 
We require that for some matrix Z^ m \ there are no nonzero columns. For 
this purpose, we can disregard the fact that it is a matrix-valued stochastic 
process and instead consider the Markov chain that results on the number 
of nonzero columns. Figure 2a shows three traces of such a Markov chain 
on K^ m \ If we define \(K;a,j3) = a^^ =1 k'+p-i » then the Markov chain 
has the transition distribution 

(2) p(/^ m+1 ) =k\K^ m \a,l3) = ^exp{-A(A^ m );a,/3)} X(K^;a,(3) k , 
which is simply a Poisson distribution with mean X(K^;a,P). Clearly, 

K (m) 

= is an absorbing state, however, the state space of the Markov 
chain is countably-infinite and to know that it will reach the absorbing state 
with probability one, we must know that does not blow up to infinity. 

In such a Markov chain, this requirement is equivalent to the statement 
that the chain has an equilibrium distribution when conditioned on nonab- 
sorption (has a quasi- stationary distribution) [Seneta and Vere- Jones, 1966]. 
For countably-infinite state spaces, a Markov chain has a (quasi-) station- 
ary distribution if it is positive-recurrent, which is the property that there 
is a finite expected time between consecutive visits to any state. Positive 
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(a) Example traces with = 50 




5 10 15 20 5 10 15 20 

Current Width Current Width K {m ^ 

(b) Expected K < m+1 ) (c) Drift 

Fig 2: Properties of the Markov chain on layer width for the CIBP, with a = 3, 
(3 = 1. Note that these values are illustrative and are not necessarily appropriate for 
a network structure, a) Example traces of a Markov chain on layer width, indexed 
by depth m. b) Expected K^ m+1 ' as a function of K^™' is shown in blue. The 
Lyapunov function £(•) is shown in green, c) The drift as a function of the current 
width K^ m \ This corresponds to the difference between the two lines in (a). Note 
that it goes negative when the layer width is greater than eight. 



recurrency can be shown by proving the Foster-Lyapunov stability crite- 
rion (FLSC) [Fayolle et al., 2008]. Taken together, satisfying the FLSC for 
the Markov chain with transition probabilities given by Eqn 2 demonstrates 
that eventually the CIBP will reach a restaurant in which the customers 
try no new dishes. We do this by showing that if K is large enough, the 
expected i^( m+1 ) is smaller than K^ m \ 

The FLSC requires a Lyapunov function C(k) : N + — > M > 0, with which 
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(d) a = 1, & = 2 




(c) a =!,,9 = 1 (e)a=|,/3 = l 

Fig 3: Samples from the CIBP-based prior on network structures, with five visible 
units. 

we define the drift function: 

oo 

Efc|*C»o [£(*>) - C{K™)] = Y,P{K {m+1) = k I K^)(£(k) - C{K^)). 

k=l 

The drift is the expected change in C(k). If there is a K^ m ' above which 
all drifts are negative, then the Markov chain satisfies the FLSC and is 
positive-recurrent. In the CIBP, this is satisfied for C(k) = k. That the drift 
eventually becomes negative can be seen by the fact that 

E klK{m) [£(k)] = \(K^;a,f3) 

is 0(ln£r( m )) and E k{K(m) [£{K^)] = is 0(K^). Figures 2b and 2c 

show a schematic of this idea. 

3.3. The CIBP as a prior on the structure of an infinite belief network. 
The CIBP can be used as a prior on the sequence Z^°\ Z^ 2 \ ■ ■ ■ from 
Section 2, to allow an infinite sequence of infinitely-wide hidden layers. As 
before, there are visible units. The edges between the first hidden layer 
and the visible layer are drawn according to the restaurant metaphor. This 
yields a finite number of units in the first hidden layer, denoted as 
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before. These units are now treated as the visible units in another IBP- 
based network. While this recurses infinitely deep, only a finite number of 
units are ancestors of the visible units. Figure 3 shows several samples from 
the prior for different parameterizations. Only connected units are shown in 
the figure. 

The parameters a and /3 govern the expected width and sparsity of the 
network at each level. The expected in-degree of each unit (number of par- 
ents) is q and the expected out-degree (number of children) is Kj J2k=i 
for K units used in the layer below. For clarity, we have presented the CTBP 
results with a and f3 fixed at all depths; however, this may be overly restric- 
tive. For example, in an image recognition problem we would not expect the 
sparsity of edges mapping low-level features to pixels to be the same as that 
for high-level features to low- level features. To address this, we allow a and f3 
to vary with depth, writing ay™' and (3^ m \ The CIBP terminates with prob- 
ability one as long as there exists some finite upper bound for and /3( m ) 
for all m. 

3.4. Priors on other parameters. Other parameters in the model also re- 
quire prior distributions and we use these priors to tie parameters together 
according to layer. We assume that the weights in layer m are drawn indepen- 
dently from Gaussian distributions with mean fjffl and precision pffl. We 
assume a similar layer- wise prior for biases with parameters and piy. 
We use layer- wise gamma priors on the z/[ , with parameters and M m ) . 
We tie these prior parameters together with global normal-gamma hyper- 
priors for the weight and bias parameters, and gamma hyperpriors for the 
precision parameters. 

4. Inference. We have so far described a prior on belief network struc- 
ture and parameters, along with likelihood functions for unit activation. 
The inference task in this model is to find the posterior distribution over 
the structure and the parameters of the network, having seen 
dimensional vectors {x n € (— 1, 1)^ (0) }^ = i- This posterior distribution is 
complex, so we use Markov chain Monte Carlo (MCMC) to draw samples 
from p{{Z( m \W^}™, {^ m \v^}%, {x n }%), which, for fixed {x n }% , is 
proportional to the posterior distribution. This joint distribution requires 
marginalizing over the states of the hidden units that led to each of the TV" 
observations. The values of these hidden units are denoted {{u^}^ =1 }^ =1 , 
and we augment the Markov chain to include these as well. 

In general, one would not expect that a distribution on infinite networks 
would yield tractable inference. However, in our construction, conditioned 
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are independent. Due to this independence, they trivially marginalize out 
of the model's joint distribution and we can restrict inference only to those 
units that are ancestors of the visible units. Of course, since this trivial 
marginalization only arises from the Z^ matrices, we must also have a 
distribution on infinite binary matrices that allows exact marginalization of 
all the uninstantiated edges. The row-wise and column-wise exchangeabil- 
ity properties of the IBP are what allows the use of infinite matrices. The 
bottom-up conditional structure of the CIBP allows an infinite number of 
these matrices. 

To simplify notation, we will use fi for the aggregated state of the model 



variables, i.e. fi = ({Z<"0, W< m \ {«r } }£U}~ =1 , { 7 (m) ^ (m) }~=o, {*X=i)- 



Given the hyperparameters, we can then write the joint distribution as 



Although this distribution involves several infinite sets, it is possible to sam- 
ple from the relevant parts of the posterior. We do this by MCMC, updating 
part of the model, while conditioning on the rest. In particular, condition- 
ing on the binary matrices {Z'^ m ^}^ 1 =1 , which define the structure of the 
network, inference becomes exactly as it would be in a finite belief network. 

4.1. Sampling from the hidden unit states. Since we cannot easily inte- 
grate out the hidden units, it is necessary to explicitly represent them and 
sample from them as part of the Markov chain. As we are conditioning on 
the network structure, it is only necessary to sample the units that are an- 
cestors of the visible units. Frey [1997] proposed a slice sampling scheme for 
the hidden unit states but we have been more successful with a specialized 
independence-chain variant of multiple-try Metropolis-Hastings [Liu et al., 
2000] . Our method proposes several (~ 5) possible new unit states from the 
activation distribution and selects from among them (or rejects them all) 
according to the likelihood imposed by its children. As this operation can 
be executed in parallel by tools such as Matlab, we have seen significantly 
better mixing performance by wall-clock time than the slice sampler. 

4.2. Sampling from the weights and biases. Given that a directed edge 
exists, we sample the posterior distribution over its weight. Conditioning on 



on the sequence Z^\ Z^ 2 \ ■ ■ ■ 



almost all of the infinite number of units 




10 



R.P. ADAMS ET AL. 



the rest of the model, the NLGBN results in a convenient Gaussian form 
for the distribution on weights so that we can Gibbs update them using a 
Gaussian with parameters 

w- post _ P w P w ^ V k 2^n u n,k'\ a \ U k > ^n,k,k') 
W P m ,k,k' ~ (m) ( m -l) v ( (m) )2 

w-post _ ( m ) , Xm-1) \^(„.( m ) \2 



W Pm,k,k' -Pw + u k /_^y U n,k>) ' 



where 



(6) e5V=7f- 1} + E ^^ (m) - (m) 



k,k" U n,k"- 



The bias 7! can be similarly sampled from a Gaussian distribution with 
parameters 

M..M , M v^iv / -i/ M^ _ v Ma 

/-s 7-post _ Pi Pi k 2^n=\\ a \ U n,k ) X n ,k ) 

(8) pU° St = P { ™ ] + Nu^ 



where 



(m) _ \ 7 (m+l) w (m+l) (m+1) 



0) x5?= E 47Xy i} <? 

fc'=i 

4.3. Sampling from the activation variances. We use the NLGBN model 
to gain the ability to change the mode of unit behaviors between discrete and 
continuous representations. This corresponds to sampling from the posterior 
distributions over the i^ m) . With a conjugate prior, the new value can be 
sampled from a gamma distribution with parameters 

(10) C/P = 4 m) + N/2 



n=l 



4.4. Sampling from the structure. A model for infinite belief networks is 
only useful if it is possible to perform inference. The appeal of the CIBP 
prior is that it enables construction of a tractable Markov chain for inference. 
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To do this sampling, we must add and remove edges from the network, 
consistent with the posterior equilibrium distribution. When adding a layer, 
we must sample additional layerwise model components. When introducing 
an edge, we must draw a weight for it from the prior. If this new edge 
introduces a previously-unseen hidden unit, we must draw a bias for it and 
also draw its deeper cascading connections from the prior. Finally, we must 
sample the N new hidden unit states from any new unit we introduce. 

We iterate over each layer that connects to the visible units. Within each 
layer m > 0, we iterate over the connected units. Sampling the edges inci- 
dent to the kth unit in layer m has two phases. First, we iterate over each 
connected unit in layer m + 1, indexed by k'. We calculate i]_uui, which is 
the number of nonzero entries in the k'th column of Z^ m+l \ excluding any 
entry in the kth. row. If fji$ k , is zero, we call the unit k' a singleton parent, 

to be dealt with in the second phase. If f^Luu is nonzero, we introduce (or 
keep) the edge from unit u^' +l ^ to u^ 1 with Bernoulli probability 




((m) 
V-k,k> 
1 A-M+/3M-1 

N 
n=l 

where Z is the appropriate normalization constant. 

In the second phase, we consider deleting connections to singleton parents 
of unit k, or adding new singleton parents. We do this via a Metropolis- 
Hastings operator using a birth/death process. If there are currently K Q 
singleton parents, then with probability 1/2 we propose adding a new one by 
drawing it recursively from deeper layers, as above. We accept the proposal 
to insert a connection to this new parent unit with M-H acceptance ratio 



a mh -inse rt {Ko + inf5{ m )+K{m) _ 1) 11 ( z (m+l) =0j fi^+D) 
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If we do not propose to insert a unit and K Q > 0, then with probability 1/2 
we select uniformly from among the singleton parents of unit k and propose 
removing the connection to it. We accept the proposal to remove the jth 
one with M-H acceptance ratio 

k^+k^-i) rr p(n^\Z^=0,n\Z^) 

After these phases, chains of units that are not ancestors of the visible 
units can be discarded. Notably, this birth/death operator samples from the 
IBP posterior with a non-truncated equilibrium distribution, even without 
conjugacy. Unlike the stick-breaking approach of Teh et al. [2007], it allows 
use of the two-parameter IBP, which is important to this model. 

4.5. Sampling From CIBP Hyperparameters. When applying this model 
to data, it is infrequently the case that we would have a good a priori idea 
of what the appropriate IBP parameters should be. These control the width 
and sparsity of the network and while we might have good initial guesses 
for the lowest layer, in general we would like to infer {a^ m \ /?( m )} as part of 
the larger inference procedure. This is straightforward in the fully-Bayesian 
MCMC procedure we have constructed, and it does not differ markedly 
from hyperparameter inference in standard IBP models when conditioning 
on Z^ m \ As in some other nonparametric models (e.g. Tokdar [2006] and 
Rasmussen and Williams [2006]), we have found that light-tailed priors on 
the hyperparameters helps ensure that the model stays in reasonable states. 

5. Reconstructing Images. We applied the model and MCMC-based 
inference procedure to three image data sets: the Olivetti faces, the MNIST 
digits and the Frey faces. We used these data to analyze the structures and 
sparsity that arise in the model posterior. To get a sense of the utility of 
the model, we constructed a missing-data problem using held-out images 
from each set. We removed the bottom halves of the test images and asked 
the model to reconstruct the missing data, conditioned on the top half. The 
prediction itself was done by integrating out the parameters and structure 
via MCMC. 

Olivetti Faces. The Olivetti faces data [Samaria and Harter, 1994] consists 
of 400 64 x 64 grayscale images of the faces of 40 distinct subjects. We di- 
vided these into 350 test data and 50 training data, selected randomly. This 
data set is an appealing test because it has few examples, but many dimen- 
sions. Figure 4a shows six bottom-half test set reconstructions on the right, 
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Fig 4: Olivetti faces a) Test images on the left, with reconstructed bottom halves on 
the right, b) Sixty features learned in the bottom layer, where black shows absence 
of an edge. Note the learning of sparse features corresponding to specific facial 
structures such as mouth shapes, noses and eyebrows, c) Raw predictive fantasies, 
d) Feature activations from individual units in the second hidden layer. 



compared to the ground truth on the left. Figure 4b shows a subset of sixty 
weight patterns from a posterior sample of the structure, with black indicat- 
ing that no edge is present from that hidden unit to the visible unit (pixel). 
The algorithm is clearly assigning hidden units to specific and interpretable 
features, such as mouth shapes, the presence of glasses or facial hair, and 
skin tone, while largely ignoring the rest of the image. Figure 4c shows ten 
pure fantasies from the model, easily generated in a directed acyclic belief 
network. Figure 4d shows the result of activating individual units in the sec- 
ond hidden layer, while keeping the rest unactivated, and propagating the 
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Fig 5: MNIST Digits a) Eight pairs of test reconstructions, with the bottom half of 
each digit missing. The truth is the left image in each pair, b) 120 features learned in 
the bottom layer, where black indicates that no edge exists, c) Activations in pixel 
space resulting from activating individual units in the deepest layer, d) Samples 
from the posterior of Z^°\ Z^ 1 ' and (transposed). 



activations down to the visible pixels. This provides an idea of the image 
space spanned by the principal components of these deeper units. A typical 
posterior network had three hidden layers, with about 70 units in each layer. 

MNIST Digit Data. We used a subset of the MNIST handwritten digit 
data [LeCun et al., 1998] for training, consisting of 50 28 x 28 examples of 
each of the ten digits. We used an additional ten examples of each digit for 
test data. In this case, the lower-level features are extremely sparse, as shown 
in Figure 5b, and the deeper units are simply activating sets of blobs at the 
pixel level. This is shown also by activating individual units at the deepest 
layer, as shown in Figure 5c. Test reconstructions are shown in Figure 5a. 
A typical network had three hidden layers, with approximately 120 in the 
first, 100 in the second and 70 in the third. The binary matrices Z^°\ Z^ l \ 
and are shown in Figure 5d. 



LEARNING THE STRUCTURE OF DEEP SPARSE GRAPHICAL MODELS 15 



II I 



■I 


11 


HI 


I 


■1 


ii 


Ill 


■in 


II 


■1 


tn 


II 


HI 


HI 

III 


1 
1 


m 
m 


ii 

li 


Wffi 1 

m 


1111 

MM 


11 
11 


11 

1! 


ii 

ii 




11 


till 


I! 




lllilli 


li 



(a) 



(I)) 



Fig 6: Frey faces a) Eight pairs of test reconstructions, with the bottom half of 
each face missing. The truth is the left image in each pair, b) 260 features learned 
in the bottom layer, where black indicates that no edge exists. 



Frey Faces. The Frey faces data 1 are 1965 20 x 28 grayscale video frames of 
a single face with different expressions. We divided these into 1865 training 
data and 100 test data, selected randomly. While typical posterior samples 
of the network again typically used three hidden layers, the networks for 
these data tended to be much wider and more densely connected. In the 
bottom layer, as shown in Figure 6b, a typical hidden unit would connect to 
many pixels. We attribute this to global correlation effects from every image 
only coming from a single person. Typical widths were 260 units, 120 units 
in the second hidden layer, and 35 units in the deepest layer. 

In all three experiments, our MCMC sampler appeared to mix well and 
begins to find reasonable reconstructions after a few hours of CPU time. 
It is interesting to note that the learned sparse connection patterns in 
varied from local (MNIST), through intermediate (Olivetti) to global (Frey), 
despite identical hyperpriors on the IBP parameters. This strongly suggests 
that flexible priors on structures are needed to adequately capture the statis- 
tics of different data sets. 



x http : / /www. cs . toronto . edu/ ~roweis7data.html 
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6. Discussion. This paper unites two areas of research — nonparametric 
Bayesian methods and deep belief networks — to provide a novel nonparamet- 
ric perspective on the general problem of learning the structure of directed 
deep belief networks with hidden units. 

We addressed three outstanding issues that surround deep belief networks. 
First, we allowed the units to have different operating regimes and infer 
appropriate local representations that range from discrete binary behavior 
to nonlinear continuous behavior. Second, we provided a way for a deep 
belief network to contain an arbitrary number of hidden units arranged 
in an arbitrary number of layers. This structure enables the hidden units 
to have nontrivial joint distributions. Third, we presented a method for 
inferring the appropriate directed graph structure of deep belief network. To 
address these issues, we introduced a novel cascading extension to the Indian 
buffet process — the cascading Indian buffet process (CIBP) — and proved 
convergence properties that make it useful as a Bayesian prior distribution 
for a sequence of infinite binary matrices. 

This work can be viewed as an infinite multilayer generalization of the 
density network [MacKay, 1995], and also as part of a more general litera- 
ture of learning structure in probabilistic networks. With a few exceptions 
(e.g., Beal and Ghahramani [2006], Elidan et al. [2000], Friedman [1998], 
Ramachandran and Mooney [1998]), most previous work on learning the 
structure of belief networks has focused on the case where all units are ob- 
served [Buntine, 1991, Friedman and Roller, 2003, Heckerman et al., 1995, 
Roivisto and Sood, 2004]. The framework presented in this paper not only 
allows for an unbounded number of hidden units, but fundamentally couples 
the model for the number and behavior of the units with the nonparametric 
model for the structure of the infinite directed graph. Rather than compar- 
ing structures by evaluating marginal likelihoods of different models, our 
nonparametric approach makes it possible to do inference in a single model 
with an unbounded number of units and layers, thereby learning effective 
model complexity. This approach is more appealing both computationally 
and philosophically. 

There are a variety of future research paths that can potentially stem 
from the model we have presented here. As we have presented it, we do 
not expect that our MCMC-based unsupervised inference scheme will be 
competitive on supervised tasks with extensively-tuned discriminative mod- 
els based on variants of maximum-likelihood learning. However, we believe 
that this model can inform choices for network depth, layer size and edge 
structure in such networks and will inspire further research into flexible 
nonparametric network models. 
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APPENDIX A: PROOF OF GENERAL CIBP TERMINATION 

In the main paper, we discussed that the cascading Indian buffet process 
for fixed and finite a and (3 eventually reaches a restaurant in which the 
customers choose no dishes. Every deeper restaurant also has no dishes. 
Here we show a more general result, for IBP parameters that vary with 
depth, written and f3^ m \ 

Let there be an inhomogeneous Markov chain A4 with state space N. 
Let m index time and let the state at time m be denoted K^ m \ The initial 
state is finite. The probability mass function describing the transition 
distribution for M. at time m is given by 



Theorem A.l. If there exists some a < oo and (3 < oo such that Vm, 
< a and ft m ) < p, then liuim^ p{K^ = 0) = 1. 

Proof. Let N + be the positive integers. The N + are a communicating 
class for the Markov chain (it is possible to eventually reach any mem- 



(12) p(id m+1 ) = k | , , ^ ) 
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i.e. p(K^ m+1 ^ = | K^) > 0, \/K( m \ If, conditioned on nonabsorption, the 
Markov chain has a stationary distribution (is quasi- stationary), then it 
reaches absorption in finite time with probability one. Heuristically, this is 
the requirement that, conditioned on having not yet reached a restaurant 
with no dishes, the number of dishes in deeper restaurants will not explode. 

The quasi-stationary condition can be met by showing that N + are posi- 
tive recurrent states. We use the Foster-Lyapunov stability criterion (FLSC) 
to show positive-recurrency of N + . The FLSC is met if there exists some 
function £(■) : N + — > R + such that for some e > and some finite B € N + , 

oo 

(13) Y,P( Kim+1) = k I R(m) ) ( £ ( k ) ~ £(if (m) )) < -e ^ > B 
k=l 

oo 

(14) Y,P( K(m+1) = k I R{m) ) C ( k ) < 00 for K(m) ^ B - 



k=l 



For Lyapunov function C{k) = k, the first condition is equivalent to 

fit™) 

^ k + /3M 




a (m) y P _ K {m) < 

k + BM - 1 



M V ^— < a V t 

k + B( m ) - 1 j^f k + B - 1 ' 



for all > 0. Thus, the first condition is satisfied for any B that satisfies 

the condition for a and (3. That such a B exists for any finite a and /3 can 
be seen by the equivalent condition 



(17) a V i - < -e for > B. 




As the first term is roughly logarithmic in K^ m \ there exists some finite B 
that satisfies this inequality. The second FLSC condition is trivially satisfied 
by the observation that Poisson distributions have a finite mean. □ 
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